Data Management Strategies

Kim Cressman and Gabriel Kamener

Catbird Stats, LLC (KC); Florida International University (GK)

2023-11-16

Goals for this presentation

  • NOT to shame you
  • Give you some knowledge to build better datasets
    • some = a manageable amount
    • …moving forward
  • Point you toward helpful resources

Spreadsheets are useful

Even I will not fight that fact

There are multiple purposes for keeping tabular data in spreadsheets:

  • Data entry
  • Data storage
  • Data analysis
  • Presentation

Spreadsheets can be dangerous

Things to be careful of

  • Proprietary software
  • Repeating data on many rows can lead to unnoticed mistakes

Repetition mistakes

data in a spreadsheet

Repetition mistakes

data in a spreadsheet, with two rows outlined in orange

Repetition mistakes

data in a spreadsheet, two rows outlined in orange, one of those rows highlighted in yellow. the data in those two rows should be the same because it references the same sampling plot, but the data in the highlighted line was accidentally pasted from the plot above.

Things to be careful of

  • Proprietary software
  • Repeating data on many rows can lead to unnoticed mistakes
  • Excessive formula use can eventually lead to unseen errors

Excel and dates

Oprah "you get a car" meme but with dates

  • dates “seen” differently in the computer than in our brains
  • dates seen differently based on operating system

Best practices

Helpful sources

Rectangles

  • One table per sheet
  • When adding data, add rows, not columns
  • One type of data per column
    • don’t type “No Data” in an otherwise numeric column
  • Be thoughtful about column names
    • and don’t use special characters in them
  • Be thoughtful about representation of missing data

Make information explicit

  • QA/QC columns, rather than comments on a cell
  • Additional columns, rather than [only] color coding

Different tables for different data types

  • Think: site information you only measure once (lat/long, habitat type, etc.) vs. information you measure every time


Different tables for different data types

  • Think: site information you only measure once (lat/long, habitat type, etc.) vs. information you measure every time
  • Tables can be related to each other via common columns, known as “keys”
    • can even do this in Excel, with XLOOKUP

Example: fish monitoring data

Example: fish monitoring data

Data safety

  • Don’t do any calculations in the raw data file!
    • Make a copy.
  • Back up your data!
    • Keep it in 3 places
    • At least one in a different physical location

Documentation (Metadata)

  • Who, what, when, where, why
  • How
  • Data dictionary
  • No universal standard, but several formats exist
  • JUST DO IT

When to move beyond spreadsheets

Wait, what’s “beyond” a spreadsheet?

Relational database!

---------------------------------

“All relational databases organize data into sets of interlinked tables.” -Thomer and Wickett 2020 (open access)

---------------------------------

“A database is, in some sense, just a collection of tables, where there’s some value in the tables that allows them to be connected to each other (the ‘related’ part of ‘relational database’).” -Data Carpentry ‘Data Management with SQL for Ecologists’ workshop

Some software examples

  • Access

  • Oracle

  • MySQL

Advantages

(of a well-built relational database)

  • “Front end” / “Back end”

    • data entry is (can be) human-friendly

    • data validation

    • data storage is computer-friendly

    • all the linkages happen without you having to think about them

Advantages

(of a well-built relational database)

  • Queries - you can pull data back out in different ways

    • e.g., if you wanted the lat/long and habitat information associated with each individual sampling event or even individual fish

      • without the errors you’d get from copying and pasting that information into every sample row
    • WITHOUT altering the original data

Barriers

  • Not everybody has ready access to database expertise

  • Not every database is designed well

  • Good databases require thoughtful design, as well as ongoing maintenance

When to think about databases

  • Long-term projects

  • Projects involving lots of complexity

  • Projects where consistency system-wide is important

Wrap-up

  • Make it rectangular
  • Think about simple tables, and linkages
  • …even if you stay in spreadsheets
  • Make it clear what’s going on, through good naming systems and metadata

Helpful sources